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Abstract 

We study the inherent space requirements of shared storage algorithms in asynchronous fault-prone 
systems. Previous works use codes to achieve a better storage cost than the well-known replication 
approach. However, a closer look reveals that they incur extra costs somewhere else: Some use un¬ 
bounded storage in communication links, while others assume bounded concurrency or synchronous 
periods. We prove here that this is inherent, and indeed, if there is no bound on the concurrency level, 
then the storage cost of any reliable storage algorithm is at least / + 1 times the data size, where / 
is the number of tolerated failures. We further present a technique for combining erasure-codes with 
full replication so as to obtain the best of both. We present a storage algorithm whose storage cost is 
close to the lower bound in the worst case, and adapts to the concurrency level. 



1 Introduction 


We reason about the storage space required for emulating reliable shared storage over fault-prone nodes. 
The traditional approach to building such storage stores full replicas of the data in each node [1]. This 
approach entails a fixed storage cost equal to the size of the data times the number of nodes, regardless 
of the level of concurrency. 

Recently, there is an active area of research of employing codes, and in particular erasure codes, in 
distributed algorithms with the goal of reducing the storage cost [3 0 El El ESI El • But when we look at 
these works closely, we find that in all asynchronous solutions, extra costs are hidden somewhere. Some 
keep an unbounded number of versions [8], or as many as the allowed level of concurrency [6]. Others 
keep unbounded information in channels mm- While others assume periods of synchrony [3] or allow 
returning obsolete values [12]. 

To provide intuition about erasure-coded reliable storage algorithms, we give in Section a simple 
space-efficient solution that only guarantees safe semantics [TO], which are too weak to be of practical use. 
We use this example to illustrate the challenges that have led algorithms that provide stronger semantics 
to store many versions of the coded data. 

Then, in Section [^ we prove that this is inherent: any lock-free algorithm that simulates reliable 
storage in an asynchronous system where / storage nodes can fail must sometimes store f + 1 full replicas 
of written data, or its storage cost can grow without bound. Specifically, our bound applies to any fault- 
tolerant implementation of a multi-writer multi-reader (MWMR) register that satisfies at least weak 
regularity, a safety notion weaker than linearizability. 

We prove our result for the fault-prone shared memory model mmm in order to avoid reasoning 
explicitly about channels. The same bound applies to message passing systems if we limit the capacity of 
communication channels. For the sake of our proof, we define a specific adversary behavior, which makes 
the proof fairly compact. 

Understanding the inherent storage cost limitation that stems from our lower bound, and in particular, 
the fact that, under high concurrency, nodes have to keep full replicas, leads us to develop an adaptive 
approach that combines the advantages of full replication and coding. We present in Section [^ an 
algorithm that simulates an FW-Terminating [1] strongly regular m MWMR register, whose storage 
requirement is close to the storage limitation in the worst case, and uses less storage in runs with low 
concurrency. The algorithm does not assume any a priori bound on concurrency; rather, it uses erasure 
codes when concurrency is low and switches to replication when it is high. 

Finally, we believe that our work is only a first effort to combine erasure coding with replication in 
order to achieve adaptive storage costs. We conclude in Section [^ with some thoughts about directions 
for future work. 

2 Preliminaries 

2.1 Model 

We consider an asynchronous fault-prone shared memory system mmm consisting of set = {6oj,..., bon} 

of base objects supporting arbitrary atomic read-modify-write (RMW) access by clients from some finite 
set n. Any / base objects and any number of clients may fail by crashing, for some predefined / < n/2. 
We study algorithms that emulate a shared object to a set of clients. 

Clients interact with the emulated object via high-level operations. To distinguish the high-level 
emulated operations from low-level base object access, we refer to the latter as RMWs. We say that 
RMWs are triggered and respond, whereas operations are invoked and return. A (high-level) operation 
consists of a series of trigger and respond actions on base objects, starting with the operation’s invocation 
and ending with its return. In the course of an operation, a client triggers RMWs separately on each 
boi G N and receives responses in return. We model the state of each boi € N as changing, according to 
the RMW triggered on it, at some point after the time when the RMW is triggered but no later than the 
time when the matching response occurs. 
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An algorithm defines the behavior of clients as deterministic state machines, where state transitions 
are associated with actions such as RMW trigger/response. A configuration is a mapping to states from 
system components, i.e., clients and base objects. An initial configuration is one where all components 
are in their initial states. 

A run of algorithm A is a (finite or infinite) alternating sequence of configurations and actions, 
beginning with some initial configuration, such that configuration transitions occur according to A. We 
use the notion of time t during a run r to refer to the configuration incurred after the action in r. A 
run fragment is a contiguous subsequence of a run. 

We say that a base object or client is faulty in a run r if it fails any time in r, and otherwise, it is 
correct. A run is fair if (1) for every RMW triggered by a correct client on a correct base object, there 
is eventually a matching response, (2) every correct client gets infinitely many opportunities to trigger 
RMWs. We again use different terminology to distinguish incomplete invocations to the high-level service 
from incomplete RMWs triggered on base objects and refer to the former as outstanding operations and 
to the latter as pending RMWs. 

Operation opi precedes operation opj in a run r, denoted opi -<r opj, if opfs response occurs before 
opj’s invoke in r. Operations opi and opj are concurrent in a run r, if neither precedes the other. A run 
with no concurrent operations is sequential. 

2.2 Storage service definitions 

We study emulations of an MWMR register, which stores a value v from a domain V, and offers an 
interface for invoking read and write operations. Initially, the register holds some distinguished initial 
value vq G V. The sequential specification for this service is as follows: A read returns the latest written 
value, or vq if none was written. 

The storage resources consumed by the MWMR register emulations discussed herein are measured in 
units of bits. For constructive algorithmic results, bits are stored in base objects following writes triggered 
by clients, and correctness lies upon the existence of a decoding algorithm that can recover u G V from 
the bits available to the reader. The common examples for such decoding algorithms are 1) the trivial 
decoder mapping D = log 2 |V| bits to the value v using the standard binary representation, as in the 
case of replication; and 2) an erasure-code decoder mapping a set of D or more code bits to v. For 
the impossibility proof we use a fundamental information theoretic argument that any representation, 
either coded or unncoded, cannot guarantee to recover v precisely from fewer than D = log 2 |V| bits. This 
argument excludes common storage-reduction techniques like compression and de-duplication, which only 
work in probabilistic setups and with assumptions on the written data. 

We now proceed to detail the properties describing the MWMR register. 

Liveness There is a range of possible liveness conditions, which need to be satisfied in fair runs 
of a storage algorithm. A wait-free object is one that guarantees that every correct client’s operation 
completes, regardless of the actions of other clients. A lock-free object guarantees progress: if at some 
point in a run there is an outstanding operation of a correct client, then some operation eventually 
completes. An FW-terminating [T] register is one that has wait-free write operations, and in addition, if 
there are finitely many write invocations in a run, then every read operation completes. 

Safety Two runs are equivalent if every client performs the same sequence of operations in both, 
where operations that are outstanding in one can either be included in or excluded from the other. A 
linearization of a run r is an equivalent sequential execution that satisfies r’s operation precedence relation 
and the object’s sequential specification. A write re in a run r is relevant to a read rd in r HU if rd -fir w; 
rel-writes (r, rd) is the set of all writes in r that are relevant to rd. 

Following Lamport |10] . we consider a hierarchy of safety notions. Lamport m defines regular and 
safe single-writer registers. Shao et al. HU extend Lamport’s notion of regularity to MWMR registers, 
and give four possible definitions. Here we use two of them. The first is the weakest definition, and we use 
it in our lower bound proof. The second, which we use for our algorithm, is the strongest definition that 
is satisfied by ABD [Ij in case readers do not change the storage (no write-back): A MWMR register is 
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weakly regular, (called MWRegWeak in [H]), if for every run r and read rd that returns in r, there exists a 
linearization Lrd of the subsequence of r consisting of the write operations in r and rd. A MWMR register 
is strongly regular, (called MWRegWO in [H]), if it satisfies weak regularity and the following condition: 
For all reads rdi and rd 2 that return in r, for all writes wi and W 2 in rel-writes{r, rdi) n rel-writes{r, rd 2 ), 
it holds that wi W 2 if and only if wi -<Lrd 2 ^ 2 - 

We extend the safe register definition and say that a MWMR register is strongly safe if there exists 
a linearization of the subsequence of r consisting of the write operations in r, and for every read 
operation rd that has no concurrent writes in r, it is possible to add rd at some point in so as to 
obtain a linearization of the subsequence of r consisting of the write operations in r and rd. 

2.3 Erasure codes 

A fc-of-n erasure code takes a value from domain V and produces a set 5 of n pieees from some domain 
E s.t. the value can be restored from any subset of S that contains no less than k different pieces. We 
assume that the size of each piece is D/k, and two functions encode and decode are given: encode gets a 
value V gY and returns a set of n ordered elements W = {(ui, 1),..., {vn, n)}, where vi, ... ,Vn G K, and 
decode gets a set W C E x N and returns v' gY s.t. if \ W'\ > k and W C W, then v = v' . In this paper 
we use k = n — 2f. Note that when A: = 1, we get full replication. 


3 A Simple Algorithm 

In order to develop intuition for the structure and limitations of distributed storage algorithms, we 
present in Section [3.I| a simple storage-efficient algorithm that ensures safe semantics, but not regularity. 
Although this algorithm has no practical use, it shows that the impossibility result of Section does not 
apply to a weaker safety property. In Section 3.2, we then illustrate how this simple algorithm can be 


extended to ensure regularity using unbounded storage (similarly to some previous works), as proven to 
be inherent by our main result in the next section. 


3.1 Safe and wait-free algorithm 

This algorithm simulates a wait-free and strongly safe MWMR register using erasure codes. It stores 
exactly n pieces of the data, one in each base object. The algorithm’s definitions are presented in 
Algorithmic and the algorithm of client Cj can be found in Algorithmic 

We define Timestamps to be the set of timestamps {num,c), s.t. num G N and c G 11, ordered 
lexicographically. We define Pieces to be the set of pairs consisting of an element from E (possible 
outputs of the encode function) and a number, and Chunks = Pieces x Timestamps. Each base object 
boi stores exactly one value from Chunks, initially {{vQ^,i), (0,0)), where vq- is the piece of vq. 

Since memory is fault-prone, actions are triggered in parallel on all base objects. This parallelism 
is denoted using ||for in the code. Operations then wait for n — f base objects to respond. Recall 
that n = 2f + k, so every two sets of n — / base objects have at least k pieces in common. Thus, if a 
write completes after storing pieces on n — / base objects, a subsequent read accessing any n — f base 
objects finds k pieces of the written value (as needed for restoring the value), provided that they are not 
over-written by later writes. 

A write{v) operation (lines |dC) first produces n pieces from v using encode, then reads from n — f base 
objects to obtain a new timestamp, and finally, tries to store every piece together with the timestamp at 
a different base object. For every base object bo, Cj triggers the update RMW function, which overwrites 
bo only if Cj’s timestamp is bigger than the timestamp stored in bo. 


A read (lines 13 -19) reads the values stored in n — / base objects, and then tries to restore valid data 
as follows. If Cj reads at least k values with the same timestamp, it uses the decode function, and returns 
the restored value. Otherwise, it returns vq. The latter occurs only if there are outstanding writes, that 
had updated fewer than n — f base objects before the reader has accessed them. Therefore, these writes 
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are concurrent with Cj’s read, and by the safety property, any value can be returned in this case. The 


algorithm’s correctness is formally proven in Appendix A.l 


Algorithm 1 Definitions. 

1: TimeStamps = N x 11, with selectors num and c, ordered lexicographically. 
2: Pieces = (E x N) 

3: Chunks = Pieces x TimeStamps, with selectors val,ts 
4: encode : V —>■ decode : —>■ V 

5: s.t. yv S V, encode{v) = (*, n)}A 

6: yW e 2®'^^, if IT C encode{v) A |1T| > k, then decode{W) = v 


Algorithm 2 Safe register emulation. Algorithm for client cj. 


1: operation write{v) 

2: W ^ encode{v) 

3: R-^ readValue{) 

4: ts -(r- {max{{ts\{{ts, *), *) G R}) + l,j) 

5: 11 for all {v, i) & W 

6: update{boi, {v,i),ts) t> trigger RMW on boi 

7 : wait for n — / responses 

8: return “ok” 

9: end 

10: update(&o, rc, ts) = 

11: if ts > bo.ts 

12: bo^{w,ts) 


13: operation readQ 
14: R ^ readValue{) 

15: if 3ts s.t. |{u I {ts,v) G R}\ > k 

16: ts' •(— ts s.t. Hr I {ts,v) G R}\ > k 

17: return decode{{v \ {ts',v) G R}) 

18: return vq 

19: end 

20: procedure readValueQ 
21 : 

22: 11 for i=l to n 

23: R = R U read{boi) 

24: wait until \R\ > n — f 

25: return R 

26: end procedure 


3.2 Achieving regularity with unbounded storage 

We now give intuition why extending this approach to satisfy regularity requires unbounded storage. 
Note that a read from a regular register must return a valid value even if it has concurrent writes, and 
that a write may remain outstanding indefinitely in case the writer fails. 

Consider a system with n = A, f = 1, k = 2, where bi is faulty and clients ci and C 2 invoking write{vi) 
and write{v 2 ) respectively, as illustrated in Figure 

Since base objects may fail, clients ci and C 2 try to store their pieces in all the base objects in parallel 
(as in Algorithm]^. Assume that ci’s first RMW on 62 and C 2 ’s RMW on 63 take effect. If these RMWs 
would overwrite the pieces in bi and 62 , and ci and C 2 would then immediately fail, the storage will 
remain with no restorable value. In this case, no later read can return a value satisfying regularity (note 
that since the two outstanding writes are concurrent with any future read, a safe register may return an 
arbitrary value). Therefore, ci and C 2 cannot overwrite the existed value in the base objects. 

Consider next a client C 3 attempting to write U 3 as in Figure Even if C 3 reads the base objects, 
it cannot learn of any complete write. Moreover, when its RMW takes effect on 64 , it cannot distinguish 
between a scenario in which C 2 and C 3 have failed (thus, their pieces can be overwritten), and the scenario 
in which one of C 2 and C 3 is slow and will eventually be the only client to complete a writes (in which 
case overwriting its value may leave the storage with no restorable value). Thus, C 3 cannot overwrite any 
piece. 

We can repeat this process by allowing an unbounded number of clients to invoke writes and store 
exactly one piece each, without allowing any piece to be overwritten. While this example only shows 
that a direct extension of Algorithm consumes unbounded storage, in the next section we prove a lower 
bound on the storage required by any protocol. 


la 
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(a) Clients ci and C 2 invoke writes. (b) Clients ci and C 2 fail, C 3 invokes write. 

Figure 1: Example scenarios of erasure coded regular storage; n = 4, / = 1, and k = 2. Small boxes 
represent pieces of the written value. Complete arrows represent RMWs that took effect, and short arrows 
represent pending ones. 

4 Storage Lower Bound 

We now show a lower bound on the required storage of any lock-free algorithm that simulates weakly 
regular MWMR register. Our bound stipulates that if the number of clients that can invoke write oper¬ 
ations is unbounded, then either ( 1 ) there is a time during which there exist f + 1 base objects each of 
which stores at least D bits of some write, or (2) the storage can grow without bound. 

Information theoretic storage model The storage lower bound presented in this section is ob¬ 
tained under a precise and natural information theoretic model of storage cost. We model the general 
behavior of a base object in a distributed protocol as follows. Upon each RMW operation triggered on 
it, the base object implements some function 8, whose inputs are the values currently stored in the base 
object and the data provided with the write. After the RMW operation, the bits output from 8 are 
everything that is stored in the base object. Upon a read operation triggered on the base object, the bits 
currently stored in it are input to some function T>i, whose output is the value returned to the reader. To 
justify this model, let us observe that the role of base objects in the distributed register emulation is to 
store sufficient information to guarantee successful information reconstruction by a client following some 
future read. In the next lemma we give a more formal dehnition of the functions 8 and Vi, and prove an 
elementary lower bound on the number of bits that 8 needs to output. 

Lemma 1. Let 8 he a function on s arguments ui,...,Us taking values from sets Ui,..., U^, respec¬ 
tively. Let the output of 8 be a binary vector {0,1}^. If there exist s functions {Vi}f^i such that 
Vi{8{ui,..., Ug)) = Ui for every assignment to ui,..., Ug, then necessarily I > |'log 2 (|Ui| • ... • |Us|)]. 

Proof. By a simple pigeonhole argument. For simplicity we assume that the sizes |Ui| are powers of 
2 for every i. Suppose the theorem statement is not true, that is, the output of 8 has fewer than 
log 2 (|Ui| • ... • |Us|) bits. Then there exist at least two assignments to ui,... ,Us that map to the same 
output of 8. Hence the outputs of the functions {Vi}f^^ will be the same on both assignments, which is 
a violation because at least one Ui differs between the two assignments. □ 

We next show how Lemma implies lower bounds on the storage used in base objects. Since the 
information reconstruction algorithm is run by the client on inputs from base objects, we may regard 
each RMW operation i as requiring the base object to store a value Ui from some set U*. The size of the 
set Uj may change arbitrarily between writes and base objects. The particular choices of set sizes are 
immaterial for the current discussion, but in general they satisfy the necessary condition that globally 
on all surviving base objects the product of set sizes is at least |V|. In the next lemma we prove that the 
most general function implemented by a base object upon RMW is a function 8 as specified in Lemma 
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Lemma 2 . Without loss of generality, a function £ used by a base object is a fixed (“hard coded”) function 
that does not depend on the instantaneous values ui,... ,Us. 

Proof. Suppose the base object has a family of functions £^,..., £^ that each maps values ui,... ,Us to 
bits. Then, in order to allow recovering the Ui values, we must store additional log2(ui) bits to inform 
the functions Pi about which £^ function was used. Therefore, this scenario is equivalent to having 
£{ui ,..., Us) = [£^{ui ,..., Us)',j], where ; represents concatenation, and T is a fixed function. □ 

Lemma addresses the possibility of base objects to reduce the amount of storage by adapting their 
functions to the instantaneous stored values. The lemma proves that without prior knowledge on the 
written data it is not possible to adaptively reduce the storage requirement mandated by Lemma Now 
we are ready to prove the main property needed for our storage model. The next theorem shows that 
each write to a base object must add a number of bits depending on the required set size for that write, 
irrespective of the information presently stored from prior writes. 

Theorem 1. Any write triggered on a base object wih value Ug G Us adds at least log2(|Us|) bits. 

Proof. We prove by induction on s. By the induction hypothesis after s — 1 writes the base object 
stores log2(|Ui| • ... • |Us-i|) bits. Then following write s triggered on the base object, we know from 
Lemmas | 1 | 2 | that any function implemented in the base object that will allow recovering ui,... ,Us needs 
at least log2(|Ui| •... • |Us|) bits. By simple subtraction we get that the new write adds at least log2(|Us|) 
bits. □ 

The outcome from Theoremis that the base object storage cost in bits is obtained as the sum of the 
storage requirements of individual writes. Hence in the sequel we can assume without loss of generality 
that each stored bit is associated with a particular write. 

With the storage model in place, we now organize the proof as follows: First, in Observation!^ we 
observe a necessary condition for a write operation to complete. Next, we define an (unfair) adversary, 
and in Lemma we show that under this adversary’s behavior, no write operation can complete as long 
as the number of base objects that store at least D bits that are associated with some written value is 
less than /. Finally, in Lemma and Theorem we show that for every size S', for any algorithm that 
uses less storage than S and with which the number of base objects that store at least D bits of some 
written value is less than f at a given time we can build a fair run in which no write operation completes. 
For any time f in a run r of an algorithm A we define the following sets, as illustrated in Figure 

• C{t): the set of all clients that have outstanding write operations at time t. 

• C~^{t) C C{t): the set of clients that have outstanding write operations writefivi) s.t. at least one 
bit associated with Vi is stored in one of the base objects or in one of the other correct clients at 
time t. 

• C~{t) = C{t)\C^{t). Clients in C~{t) may have attempted to store a bit via an RMW that did not 
respond, or may have stored information that was subsequently erased, or may have not attempted 
to store anything yet. 

• F{t) = {bi G N \ bi stores D bits of some write at time t }. 

From the definition of C~^{t) we get the following: 

Observation 1. At any time t in a run r, the storage size is at least |C'"’“(f)| bits. 

Observation 2 . Consider a run r of an algorithm that simulates a weakly regular lock-free MWMR 
register, and a write operation w in r. Operation w cannot return until there is time t s.t. for every 
B C N s.t. \B\ = n — f, there is some client in C{t) whose pending write’s value can be restored from B. 
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bi 



F(t+1) 



(b) Time t + 1 

Figure 2: Example run of a storage algorithm. Clients ci,..., C 4 have outstanding writes. 


Proof. Assume that some write completes when there is a set B C A^s.t. \B\ = N — f and there is no 
client in C{t) whose write’s value can be restored from B. Now, let all the base objects in \ i? and 
all the clients in C{t) fail, and invoke a read operation rd. By lock-freedom, rd completes, although no 
value satisfying weak regularity can be returned. A contradiction. □ 

For our lower bound, we define a particular environment behavior that schedules actions in a way that 
prevents progress: 

Definition 1. [Ad) At any time t. Ad schedules an action as follows: 

1. If there is a pending RMW on a base object in N\F{t) by a client in C~{t), then choose the longest 
pending of these RMWs, allow it to take effect on the corresponding base object, and schedule its 
response. 

2. Else, choose in round robin order a client Cj G 11 that wants to trigger an RMW, and schedule cfs 
action without allowing it to affect the base object yet. 

In other words. Ad delays RMWs triggered by clients in as well as RMWs on base objects in F{t), 

and fairly schedules all other actions. Thus, though this behavior may be unfair, in every infinite run 
of Ad, every correct client gets infinitely many opportunities to trigger RMWs. We demonstrate Ad’s 
behavior in Figure]^ (a) Clients C 2 and C 4 are in C~{t) at time t, where C 4 has no pending RMWs and 
C 2 has one triggered RMW on hi G F{t) and one triggered RMW on 63 0 F[t). Therefore, by the first 
rule. Ad schedules the response on the RMW triggered by C 2 on 63 . (b) In this example C 2 overwrites 63 
and so C 3 moves from C~^ to C~. Since C 3 is the only client that has a pending RMW on a base object 
not in F{t + 1), Ad schedules the response on the RMW triggered by C 3 on 62 at time t + 1. Now notice 
that at time t + 2 there is no client in C~{t + 2) with a pending RMW on a base object in \ F{t + 2), 
and thus, by the second rule. Ad chooses in round robin a client in 11 and allows it to trigger an RMW. 

The following observation immediately follows from the adversary’s behavior. 

Observation 3. Assume an infinite run r in which the environment behaves like Ad. For each base 
object bo, if bo G F{t) at some time t, then bo G F{t') for all t' > t. 

Another consequence of Ad’s behavior is captured by the following: 

Lemma 3. As long as the environment behaves like Ad, for any time t when |E(f)| < f, there is a set 
B of n — f base objects s.t. there is no client in C{f) whose value can he restored from B at time t. 
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Proof. As soon as a client Cj stores a piece of data in a base object, Cj joins C~^, and from that point 
on, as long as its data remains in the system, Cj is prevented by Ad from storing any further values. 
Therefore, unless Cj stores all D bits of its value in some base object, it is impossible to reconstruct this 
value from the bits that were stored. Since the number of base objects storing all D bits of some client 
value at some time t is no more than \F{t)\, and since |T(t)| < /, the lemma follows. 

□ 


From Observation [2] and Lemma [ 3 ] we conclude: 

Corollary 1 . Consider a run r of an algorithm that simulates a weakly regular lock-free MWMR register. 
If the adversary behaves like Ad, and |T(t)| < / for all t in r, then no write completes in r. 

Having shown that adversary Ad can prevent progress in algorithms that store D bits of information 
in too few base objects, we turn to show that we can prevent progress also in fair runs, leading to violation 
of lock-freedom. 

Lemma 4 . Consider a finite run r with t steps of an algorithm that simulates a lock-free MWMR register, 
where the environment behaves like adversary Ad. If C~{t) 7^ {} and |T(t)| < f, then it is possible to 
extend r by allowing the environment to continue to behave like Ad up to a time t' > t when either 
\F(t')\ > f or some client Ci G C{t') either returns (i.e., completes the write) or receives a response from 
some base object. 

Proof. Consider a client c* G C~{t), and denote by Tc-(t) the set of base objects on which Cj has pending 
RMWs at time t. We first show that if Cj neither receives a response from any base object nor returns, 
we can extend r to some time t” s.t. \Tc^{t")\ > / at time t". 

We extend r by allowing the environment to continue to behave like Ad until the hrst time t' in which 
Ci is the next client chosen by the adversary to trigger an RMW. If Cj receives a response from some base 
object by time t', we are done. Else, by definition of Ad, Tc^{t') C F(t'). Now consider a fair run r' that 
is identical to r till time t', and at time t' all the clients except q fail. Notice that \Tc^{t')\ < |E(t')| < /, 
so Ci cannot wait for responses from base objects in Tc.{f), and therefore, by lock-freedom, Ci either 
returns, or triggers an RMW on some base object in \ Tc^{t') at time t' in r'. The runs r and r' are 
indistinguishable to Cj, hence, c* either returns or triggers an RMW on some base object in A \ Tc. (F) at 
time t' in r. If a returns we are done. 

We repeat this extension several times until, (after at most f -\-l times), at some time t", \Tc^{t")\ > /. 
If|F(F')l > /, we are done. Otherwise, Fait") F{t"), and therefore. Ad schedules a response to one of 
the pending RMWs of Cj at time t”. 

□ 

Theorem 2 . For any S, there is no algorithm that simulates a weakly regular lock-free MWMR register 
with less storage than S s.t. at every time t, \F{t)\ < f. 

Proof. Assume by way of contradiction that there is such an algorithm, A. We build a run of A in which 
the environment behaves like adversary Ad. 

We iteratively build a run r with infinitely many responses, starting by invoking S write operations 
and allowing the run to proceed according to Ad until some time t. By the assumption, the storage is 
less than S, so by Observation|C'"''(t)| < S, and since |C'(t)| = S, C~{t) 7^ {}. Now since \F{t)\ < /, 
by Lemma we can extend r to a time t', where the environment behaves like Ad until time t’ and some 
client Ci G C~{t') either returns or receives a response from some base object at time t'. By Corollary]^ 
Ci does not return, and thus, it receives a response. 

By repeating this process, we get a run r with infinitely many responses. By Observation and by 
the assumption that \F{t)\ < /, there is a time ti in r s.t. for any time t2 > ti, F{ti) = F{t2). Notice 
that by the adversary’s behavior, each correct client gets infinitely many opportunities to trigger RMWs. 
In addition, since Ad picks responses from base objects not in F{t) in the order they are triggered, 
every client that receives inhnitely many responses, receives a response to every RMW it triggers on a 


base object in \ F{ti). Therefore, we can build a fair run r' that is identical to r but every base 
object bo G F{ti) fails at time ti, and every client that receives finitely many responses fails after its last 
response. Since there are infinitely many responses in r' and the number of clients invoking operations 
in this run is finite, there is at least one client that receives infinitely many responses in r', and thus is 
correct in r. Therefore, by lock-freedom, some client eventually completes its write operation in r'. Since 
r and r' are indistinguishable to all clients and base objects that are correct in both, the same is true in 
r. A contradiction to Corollary 

□ 

From Theoremit follows that if the storage is bounded, then there is a time in which f + 1 base objects 
store D bits of some write. This yields the following bound: 

Corollary 2. There is no algorithm that simulates a weakly regular lock-free MWMR register and stores 
less than (/ -I- 1)D bits in the worst case. 


5 Strongly Regular MWMR Register Emulation 

We present a storage algorithm that combines full replication with erasure coding in order to achieve 
the advantages of both. The main idea behind our algorithm is to have base objects store pieces from 
at most k different writes, and then turn to store full replicas. In Appendix |A.2t we prove the following 
about our algorithm: 


Theorem There is an FW-terminating algorithm that simulates a strongly regular register, whose 
storage is bounded by (2/ -|- k)2D bits, and in runs with at most c < k concurrent writes, the storage 
is bounded by (c -|- l)D/k bits. Moreover, in a run with a finite number of writes, if all the writers are 
correct, the storage is eventually reduced to {2f-\-k)D/k bits. 

Data structure The algorithm uses the same definitions as the safe one (Section]^, given in 
Algorithmic and its pseudocode appears in Algorithms |C and |C The algorithm relies on a set of n shared 
base objects boi, ..., bon each of which consists of three fields Vp, Vf, and storedTS: 


boi = {storedTS,Vp,Vf) s.t. Vf,Vp C Chunks, and storedTS G TimeStamps, 
initially ((0,0), {((0, 0), (uq,, i))}, {}). 


The Vp field holds a set of timestamped coded pieces of values so that the piece of any value 
can only be stored in the Vp field of object boi. The Vf field stores a timestamped replica of a single 
value, (which for simplicity is represented as a set of k coded pieces). And storedTS holds the highest 
timestamp of a write that is known to this object to have completed the update round on n — / base 
objects (see below). 

Write operation and storage efficiency The write operation (lines [3 -15) consists of 3 sequen¬ 
tially executed rounds: read timestemp, update, and garbage collection', and, the read consists of one or 
more sequentially executed read rounds. At each round, the client invokes RMWs on all base objects 
in parallel, and awaits responses from at least n — f base objects. The read rounds of both write and 


read rely on the readValue routine (lines 23-31) to collect the contents of the Vp and Vf, fields stored at 
n — f base objects as well as to determine the highest storedTS timestamp known to these objects. The 


implementations of the update and garbage collection rounds are given by the update (lines 32 - 39) and 


GC (lines 40-45) routines, respectively. 


The write implementation starts by breaking the supplied value v into k erasure-coded pieces (line 
1^. This is followed by invoking the read round where the client uses the combined contents of the Vp, 
Vf and storedT S fields returned by readValue to determine the timestamp ts to be stored alongside v on 
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the base object. This timestamp is set to be higher than any other timestamp that has been returned 
(line thus ensuring that the order of the timestamps associated with the stored values is compatible 
with the order of their corresponding writes (which is essential for regularity). 

The client then proceeds to the update round where it attempts to store the coded piece (e, i) of 
V in boi-Vp if the size of boi.Vp is less than k (lines 36), or its full replica in boi.Vf if ts is higher than the 
timestamp associated with the value currently stored in boi.Vf (line 38). Note that storing (e,z) in boi.Vp 
coincides with an attempt to reduce its size by removing stale coded pieces of values whose timestamps 
are smaller than storedTS (line 36). This guarantees that the size of Vp never exceeds the number c < k 
of concurrent writes, which is a key for achieving our adaptive storage bound. Lastly, the client updates 
boi-storedTS so as its new value is at least as high as the one returned by the readValue routine. This 
allows the timestamp associated with the latest complete update to propagate to the base object being 
written, in order to prevent future writes of old pieces into this base object. 

In the write’s garbage collection round, the client attempts to further reduce the storage usage by (1) 
removing all coded pieces associated with timestamps lower than ts from both boi.Vp and boi.Vj (lines 


41 


42), and (2) replacing a full replica (if it exists) of its written value v in boi.Vf with its coded piece 
It is safe to remove the full replica and values with older timestamps at this point, since 


(e,i) (line [44|) 

once the update round has completed, it is ensured that the written value or a newer written value is 
restoreable from any n — / base objects. This mechanism ensures that all coded pieces except the ones 
comprising the value written with the highest timestamp are eventually removed from all objects’ Vp and 
Vf sets, which reduces the storage to a minimum in runs with finitely many writes, which all complete. 
The garbage collection round also updates the boi.storedTS field to ensure its value is at least as high as 
ts, reflecting the fact that a write with ts' > ts that the update round. 

Key Invariant and read operation The write implementation described above guarantees the 
following key invariant: at all times, a value written by either the latest complete write or a newer write 
is available from every set consisting of at least n — f base objects (either in the form of k coded pieces 
in the objects’ Vp helds, or in full from one of their Vf helds). Therefore, a read will always be able to 
reconstruct the latest completely written or a newer value provided it can successfully retrieve k matching 
pieces of this value. However, a read round may sample different base objects at different times (that is, 
it does not necessarily obtain a snapshot of all base objects), and the number of pieces stored in Vp is 
bounded. Thus, the read may be unable to see k matching pieces of any single new value for indefinitely 
long, as long as new values continue to be written concurrently with the read. 

To cope with such situations, the reads are only required to return in runs where a finite number of 
writes are invoked, thus only guaranteeing FW-Termination. Our implementation of read (lines [16 -22) 
proceeds by invoking multiple consecutive rounds of RMWs on the base objects via the readValue routine. 
After each round, the reader examines the collection of the values and timestamps returned by the base 
objects to determine if any of the values having k matching coded pieces are associated with timestamps 
that are at least as high as storedTS (line 18). If any such value is found, the one associated with the 
highest timestamp is returned (line 21). Otherwise, the reader proceeds to invoke another round of base 
object accesses. Note that returning values associated with older timestamps may violate regularity, since 
they may have been written earlier than the write with timestamp storedTS, which in turn may have 
completed before the read was invoked. 
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Algorithm 3 Strongly regular register emulation. Algorithm for client cj. 


1 

local variables: 


2 

storedTSjts € TimeStamp, WriteSet G Pieces 


3 

operation Write(v) 


4 

WriteSet G- encode(v) 


5 

{storedTS, ReadSet) G- readValuei) 

> round 1: read timestamps 

6 

n G- max{storedTS.num, max{n' \ ((n',*),*) G ReadSet}) 


7 

ts^ {n + 1, j) 


8 

11 for i=l to n 

> round 2: update 

9 

update{boi, WriteSet, ts, storedTS, i) 


10 

wait for n — / responses 


11 

11 for i=l to n 

> round 3: garbage collect 

12 

GC{boi, WriteSet, ts,i) 


13 

wait for n — / responses 


14 

return “ok” 


15 

end 


16 

operation Read}) 


17 

{storedTS, ReadSet) G- readValuei) 


18 

while ')ts > storedTS s.t. IKts,?;) | {ts,v) G ReadSet}] > k 


19 

{storedTS, ReadSet) G- readValue) 


20 

ts' G- max ( {(ts,n) {ts,v) G ReadSet}] > k) 

i s s i o 'J. <5^ 


21 

return decode{{v {ts',v) G ReadSet}) 


22 

end 



Algorithm 4 Functions used in strongly regular register emulation. 

23 

procedure readValue}) 


24 

ReadSet G- {}, T ^ {} 


25 

11 for i=l to n 


26 

tmp G- read{boi) 


27 

ReadSet G- ReadSet U tmp.Vf U tmp.Vp 


28 

T ^ T U {tmp.storedTS} 


29 

wait for n — / responses 


30 

return {max(T), ReadSet) 


31 

end procedure 


32 

update(&o, WriteSet,ts, storedTS,i) = 


33 

if ts < bo.storedTS 


34 

return 


35 

if &o.l^ < k 

> write a piece and remove old pieces 

36 

bo.Vp G- bo.Vp \ {{ts', v) G bo.Vp ] ts' < storedTS} U {{ts, {e, i)) ] 

{e,i) G WriteSet} 

37 

else if bo.Vf = {} V 3ts' < ts : {ts', *) G bo.Vf 

> write a full replica 

38 

bo.Vf G- {{ts, {e,j)) ] {e,j) G WriteSet A j G {1,..., k}} 


39 

bo.storedTS G- max{bo.storedTS, storedTS) 


40 

GC{bo, WriteSet, ts,i) = 


41 

bo.Vp G- {{ts',v) G bo.Vp]t.s' > ts} 

> keep only new pieces 

42 

bo.Vf G- {(ts', 1 ') G bo.Vf ]ts' > ts} 


43 

if {ts, *) G bo.Vf 

> if Vf holds a full replica of my write 

44 

bo.Vf G- {{ts,{e,i)) ] {e,i) G WriteSet} 

> keep only one piece of it 

45 

bo.storedTS G- max{bo.storedTS, ts) 



6 Discussion 

We studied the storage cost of shared register simulations in asynchronous fault-prone shared memory. 
We proved a lower bound on the required storage of any lock-free algorithm that simulates a weakly 
regular MWMR register. Our bound stipulates that if write concurrency is unbounded, then either (1) 
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there is a time during which there exist / + 1 base objects each of which stores a full replica of some 
written value, or (2) the storage can grow without bound. 

We showed that our lower bound does not hold for safe register emulation. And finally, by understand¬ 
ing these inherent limitations, we introduced a new technique for emulating shared storage by combining 
full replication with erasure codes. We presented an implementation of an FW-Terminating strongly 
regular MWMR register, whose storage cost is adaptive to the concurrency level of write operations up 
to certain point, and then turns to store full replicas. In periods during which there are no outstanding 
writes, our algorithm’s storage cost is reduced to a minimum. 

Our work leaves some questions open for future work. First, we conjecture that a wait-free implemen¬ 
tation with similar storage costs requires readers to write. Second, our algorithm requires more storage 
than the bound. We believe that our technique can be used for implementing additional adaptive algo¬ 
rithms, with storage costs closer to the lower bound. Another interesting question that remains open is 
whether the liveness condition of the lower bound is tight. In other words, is there an algorithm that 
emulates an obstruction-free weakly regular register with a better storage cost. 
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A Correctness Proofs 


A.l Wait-Free and Safe Algorithm 

Here we prove the algorithm in Section]^ 

Lemma 5. The storage of the algorithm is nD/k. 

Proof. The size of each piece is D/k. We have n base objects, and each base object stores exactly one 
piece. 

□ 


Lemma 6. The algorithm is wait-free. 

Proof. There are no loops in the algorithm, and the only blocking instructions are the waits in lines and 


24 In both cases, clients wait for no more than n — f responses, and since no more than / base objects 
can fail, clients eventually continue. Therefore, a client that gets the opportunity to perform infinitely 
many actions completes its operations. 

□ 


We now prove that the algorithm satisfies strongly safety. We relay on the following single observation. 
Observation 4. The timestamps in the base objects are monotonically increasing. 

Definition 2. For every run r, we define the sequential run as follows: All the completed write 
operations in r are ordered in a^r by their timestamp. 

Lemma 7. For every run r, the sequential run is a linearization of r. 

Proof. Since has no read operations, the sequential specification is preserved in Thus, we left to 
show the real time order; For every two completed writes Wi, Wj in r, we need to show that if Wi -<r wj, 
then Wi -<a^ Wj. 

Denote wfs timestamp by ts. By Observation]^ at any point after wfs return, at least n — f base 
objects store timestamps bigger than or equal to ts. When Wj picks a timestamp, it chooses a timestamp 
bigger than those it reads from n — f base objects. Since, n > 2/, Wj picks a timestamp bigger than ts, 
and therefore Wj is ordered after Wi in ard- 

□ 


Definition 3. For every run r, for every read rd that has no concurrent write operations in r, we define 
the sequential run by adding rd to after all the writes that precede it in r. 

In order to show that the algorithm simulates a safe register, we proof in Lemmas and that the 
real time order and sequential specification respectively, are preserved in 

Lemma 8. For every run r, for every read rd that has no concurrent write operations in r, preserves 
r’s operation precedence relation (real time order). 

Proof. By Lemma the order between the writes in are preserved, and by construction of Grd the 
order between rd and write operations is also preserved. 

□ 


Lemma 9. Consider a run r and any read rd that has no concurrent writes in r. Then rd returns the 
value written by the write with the biggest timestamp that precedes rd in r, or vq if there is no such write. 
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Proof. In case there is no write before rd in r, since there are also no writes concurrent with rd, rd reads 
pieces with timestamp (0, 0) from all base objects, and thus, returns vq. Otherwise, let w be the write{v) 
associated with the biggest timestamp ts among all the writes invoked before rd in r. Let t be the time 
when rd is invoked. Recall that rd has no concurrent writes, so all the writes invoked before time t 
complete before time t and store there pieces in n — / base objects unless the base objects already hold 
a higher timestamp. By Observation and the fact that w has the highest timestamp by time t, we get 
that at time t there are at least n — f base objects that store a piece of v. Since n = 2f + k, every two 
sets of n — / base objects have at least k base objects in common. Therefore, rd reads at least k pieces 
of V, and thus, restores and returns v. 

□ 

Corollary 3. There exists an algorithm that simulates a safe wait-free MWMR register with a worst-case 
storage cost of nD/k = (2//fe + 1)D. 

A.2 Strongly Regular Algorithm 

Here we prove the algorithm in Section We start by proving the storage cost. 

Observation 5. For every run of the algorithm, for every base object boi, boi.ts monotonically increasing. 

Lemma 10. Consider a run r of the algorithm, and two writes wi,W 2 , where wi writes with timestamp 
tsi. If wi -<r W 2 , then W 2 sets its ts, to a timestamp that is not smaller than tsi. 

Proof. By Observation!^ for each base object bo, bo.ts is monotonically increasing. Therefore, after wi 
finishes the garbage collection phase, there is a set S consisting of n — / base objects s.t. for each boi £ S, 
boi.ts > ts. Recall that n = 2f -\-k, thus every two sets of n — / base objects have at least one base object 
in common. Therefore, W 2 gets a response from at least one base object in S in its first phase, and thus 
sets ts = ts' s.t. ts' > ts. 

□ 

Lemma 11. For any run r of the algorithm, for any base object bo at any time t in r, bo.Vp does not 
store more than one piece of the same write. 

Proof. The writes perform the second phase at most one time on each base object bo, and in each update 
they store at least one piece in bo.Vp. And since they does not store in bo.Vp during the third phase, the 
lemma follows. 

□ 

Lemma 12. Consider a run r of the algorithm in which the maximum number of concurrent writes is 
c < k — 1. Then the storage at any time in r is not bigger than (2/ + k){c-\- \)D/k bits. 

Proof. Recall that we assume that n = 2f -\- k and the size of each piece is D/k. Thus it suffices to show 
that there is no time t in r s.t. some base object stores more than c + 1 pieces at time t. 

Assume by way of contradiction that the claim is false. Consider the time t when some bo £ N stores 
c + 2 pieces for the first time. Notice that | 6 o.I^| < c + 1 < A: till time t, and therefore, bo.Vp does not 
contain more then one piece from the same write, and bo.Vf = T till time t'. Now consider the write 
w that was invoked last among all the writes that store pieces in bo.Vp at time t, denote its piece by p. 



when p was added. A contradiction. 


□ 
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Lemma 13. The storage is never more than {2f + k)2D bits at any time t in any run r of the algorithm. 
Proof. Each base object stores no more than 2k pieces at any time t in r. The lemma follows. 

□ 

Lemma 14. Consider a run r of the algorithm with finite number of writes, in which all writes correct. 
Then the storage is eventually reduced to (2/ + k)D/k bits. 

Proof. Consider a write w with the biggest timestamp ts in r. Since w is correct, and since writes are 
wait-free, w returns, and eventually performs free on every base object. Consider a base object bo s.t. w 
performs free on bo at time t. Notice that w deletes all pieces with smaller timestamps than ts and set 
bo.ts = ts at time t. Now recall that bo ignore all updates with timestamp less than bo.ts, and therefore, 
bo store only re’s piece at any time after time t. The lemma follows. 

□ 


From Lemmas 12, 13 and 14 


we 


get: 


Corollary 4. The storage of the algorithm is bounded by {2f + k)2D bits, and in runs with at most c < k 
concurrent writes the storage is bounded by (c + l)D/k bits. Moreover, in a run with a finite number of 
writes, if all the writes are correct, the storage is eventually reduced to (2/ + k)D/k bits. 


We no prove the liveness property. 


Lemma 15. Consider a fair run r of the algorithm. Then every write w invoked by a correct client Ci 
eventually completes. 

Proof. Consider a correct client Cj. The write w is divided into three phase s.t. in each phase, Ci invokes 
operations on all the base objects, and waits for n — / responses. The run r is fair, so every action invoked 
by Ci on a correct base object eventually returns, and no more than / base objects fail in r. Therefore, 
eventually Cj receives n — f responses in each of the phases and returns. 

□ 


Observation 6. When a piece from bo.Vp is deleted, bo.ts is increased. 

Lemma 16. If at time t, Ci completes the second phase of write with timestamp ts, then for every t' > t 
for every S C N s.t. |5| > n — f, exist write w with ts' > ts s.t. at least k pieces of w are stored in S. 

Proof. Consider time t'. Let ts be the highest timestamp written by a write w that completed the second 
phase by time t. It is sufficient to show the lemma hold for ts. 

First note that V6o, bo.ts < ts before time t, because no write with a larger timestamp than ts started 
the third phase. This means that te’s update left at lest one piece in which bo it occurred. Now consider 
a set S oi n — f base objects, and since n = 2f + k, w’s update occurred in set S' that contains at least 
k base objects in S. 

If w wrote to Vp, it was not overwritten by time t, because (1) no other write began free with 
timestamp bigger than ts, and (2) since there is no base object bo s.t. bo.ts > ts, no write delete w’s piece 
in the second phase. Therefore if w wrote to Vp in all base objects in S', the lemma holds. 

Otherwise, w wrote k pieces to Vj in base objects in some set S" C S'. Consider two cases: First, 
there is base object bo' G S" s.t. some write overwritten w’s pieces in bo'.Vf before time t. Since there is 
no write with timestamp bigger than ts that started the third phase before time t, it is guarantee that 
k pieces with timestamp ts' > ts stored in bo'.Vf at time t, and the lemma holds. Else, since w’s pieces 
stored in S' \ S" does not overwritten before time t, the lemma holds (no matter if w performed the third 
phase or not). 

□ 

Invariant 1. For any run r of the algorithm, for any time t in r, for any set S ofn — f base objects. Let 
tSs = max{bo.ts \ bo G S}. Then there is a timestamp ts' > tss s.t. there are at least k different pieces 
associated with ts' in S. 
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Proof. We prove by induction. Base: the invariant holds at time 0. Induction: Assume that the 
induction holds before the action is scheduled, we show that it holds also at time t. Assume that the 
action is RMW on a base object bo, and consider any set 5 of re — / base objects. If 6o ^ 5 then the 
invariant holds. Else, consider the two possible RMW actions: 


The action is update. If no pieces are deleted, the invariant holds. If bo.ts is increased, then 
consider the write with timestamp ts that is the the biggest timestamp among all writes that 
complete the second phase before time t. Notice that bo.ts < ts at time t, and by Lemma 16, the 
invariant holds. The third option is that a piece p with timestamp ts' > bo.ts of a write w is deleted 
and bo.ts is not increased. Note that by Observation]^ such piece can be deleted only from bo.Vf, 
and since p is overwritten by k pieces with bigger timestamp, the invariant holds. 

The action is free. If bo.ts is not changes, then the invariant holds. Else, Consider the write 
with the biggest timestamp ts among all writes that complete the second phase before time t. Note 


that bo.ts is set to a timestamp ts' < ts, so by Lemma 16 the invariant holds 


□ 


Lemma 17. Consider a fair run r of the algorithm. If there is a finite number of write invocations in r, 
then every read operation rd invoked by a client Ci eventually returns. 


Proof. Assume by way of contradiction that rd does not return in r. By Lemma 15, the writes are wait- 
free, and since the number of write invocations in r is finite, there is a time t in r s.t. no write performs 
actions after time t. Therefore, any read that invokes readValueif) procedure after time t receives a set 
S of values that is stored in a set of re — / base objects at time t. By invariant there is a timestamp 
ts s.t. there is at least k different pieces in S associated with ts, and ts > bo.ts for all bo G S. Now since 
the every correct read rd invokes readValuei) infinitely many times in r, rd returns. A contradiction. 

□ 


The next corollary follows from Lemmas 15 17 


Corollary 5. The algorithm satisfies the WF-termination property. 
We now prove that the algorithm satisfies strong regularity. 


Definition 4. Eor every run r, ar is a sequential run s.t. the writes in r are ordered in Ur by their 
timestamp, and every read in r that returns a value associate with timestamp ts, is ordered in 
immediately after the write that is associate with timestamp ts. 

Eor simplicity we say the that vq was written by write wq that associated to timestamp 0 at time 0. 

Lemma 18. Consider a run r, and a read rd that returns a value v. Consider also the timestamp ts' that 
rd obtains in line^^ (Algorithm^ . Then v is the value written by a write associated with timestamp ts' 
or Vq if ts' = 0. 

Proof. By the code, if ts' = 0, then rd returns vq. Now notice that rd obtains at least k different pieces 
associated with timestamp ts', thus by decode definition, rd returns v. 

□ 


Corollary 6. For every run r, ar satisfies the sequential specification. 

Observation 7. Consider a write w that obtains ts and ts in the first phase, then ts > ts. 

Lemma 19. For every run r, for every two writes wi,W 2 with timestamp tsi,ts 2 . If W 2 was invoked 
after wi finished the second phase, then tsi < ts 2 . 
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Proof. First notice that for every base object bo, if a write w overwrites pieces of a write w' in bo, Vf, 
that w' timestamp is bigger than u)'’s. And by Observation]^ if w deletes w''’s piece from bo.Vp, then it 
stores a piece with bigger timestamp than w'^s timestamp. Therefore, the maximal timestamp in each 
base object is monotonically increasing. Now recall that in the second phase wi performed update on 
n — f base object, and notice that after wi performs update on base object bo the maximal timestamp 
in bo is at lest as big as tsi. Now since two sets of n — / base object have at least one base object in 
common, W 2 picks ts > tsi. 

□ 


Lemma 20. For every run r, for every two writes wi,W 2 in r, ifwi -<r W 2 , then W 2 is not ordered before 
wi in ar- 


Proof. Follows immediately from Lemma 19 


□ 


Lemma 21. For every run r, for every read rd and write wi, if rd -<r wi, then wi is not ordered before 
rd in ar. 


Proof. Assume that rd returns value that is associated with timestamp ts belonging to some write w, 
and wi is associated with timestamp tsi. Since rd returns re’s value, w begins the third phase before rd 
returns. And since wi was invoked after rd returns, wi was invoked after w’s second phase. Therefore, 
by Lemma 1^, tsi > ts, and thus wi is ordered after w in ar. Recall that by the construction of ar, rd 
is ordered immediately after w in ar, hence, rd is ordered before wi in ar. 

□ 


Lemma 22. For every run r, for every read rd and write wi, if wi -<r fd, then rd is not ordered before 
wi in ar. 

Proof. Consider a write wi with timestamp tsi and a read rd s.t. wi -<r rd. Assume by way of con¬ 
tradiction that rd is ordered before wi in ar. Then rd returns a value with a timestamp ts that is 
associated with a write w that is ordered before wi in ar. By the construction of ar, tsi > ts. Now 
since wi completed the third phase before rd invoked, and since by Observation for each bo, bo.ts is 
monotonically increasing, when rd invoked, for every set S' of n — / base objects, the maximal bo.ts of 
all bo € S is bigger than or equal to tsi, and thus bigger than ts. Therefore rd set ts, in the hrst phase, 
to timestamp bigger than ts, and thus does not return w's value. A contradiction. 

□ 


The next corollary follows from Corollary]^ and Lemmas 20 21 22 


Corollary 7. The algorithm simulates a strongly regular register. 


The following theorem stems from Corollaries |4j[5} and0 

Theorem 3. There is a FW-terminating algorithm that simulates a strongly regular register, which 
storage is bounded by (2/ -|- k)2D bits, and in runs with at most c < k concurrent writes, the storage 
is bounded by (c -|- l)D/k bits. Moreover, in a run with a finite number of writes, if all the writes are 
correct, the storage is eventually reduced to {2f + k)D/k bits. 
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